Project Title Description Hardware Software Electrical component themed AI detection and identification. A mounted camera above a surface (part of the product) Produces a controlled environment live feed for the application An application running inference on a live USB camera feed (optionally imported picture or video) Application Modification of the provided data to simulate differences in the environment and to provide imperfections to train against Augmentation examples Addition of glare Rotation Blurring Addition of spots GUI The Application is based on Qt Creator, using C++ Inference Running on C++ Utilising Ultralytics YOLOv8 Summary Detection via Inference Detect and display boundaries for each identified class from the input image using Inference. Identifcation Post-processing of the components in the bounding boxes detected by inference, which may have additional information that can be identified by a variety of approaches. Examples LEDs Resistors Resistor code value LED color Technology AI based Electrical Component Identifier IC Components Pin count Information written on the component Features Inference Classes to train the model to detect: Resistor Diode Capacitor LEDs Integrated Circuits AC DC LDR Milestones Base camera rig Initial inference model training Inference running Testing with video footage from a mobile device Research Models Ultralytics YOLO Live Labeling Focus Audience Set Rig The set position of the camera, a significant reduction in distance between the objects, and significant consistency of the lighting provided by the ring light, and the static background - will boost the confidence of the inference considerably. Training Running Post-processing Rationale Timeline Gantt Chart Live training YOLOv5 Training YOLOv8 Due to the angle and lighting both being known and mostly set thanks to using a set rig, the input dataset does not need to cover angles and lighting outside what the rig will expose it to during runtime. The sum of all the points covered above results in a significant reduction in data required to train when compared to a setup without a set rig, for equivelant confidence values during runtime. The angle range is reduced only to looking from top to down, eliminating the rest of the angle range. While the lighting will change depending on the room conditions, the ring light around the camera will provide significant consistency in lighting. While this does not not eliminate the necessity to train against various lighting conditions, it does reduce their significance and increase certainty of the detection. Only the components being detected need to be trained in all angles, as opposed to the camera gathering the dataset requiring to be positioned in different angles. Having a top to down view also eliminates the majority of issues that come with glare from high luminosity bodies, such as clouds or the sun. A set rig significantly limits the distance that the objects will be from the camera during runtime, allowing for further confidence in the predictions. Static background Angle range Lighting Apart from dust or unexpected objects present on the rig's surface, which should be removed before usage - the background that the objects are in front of will stay mostly consistent. This reduces the necessity to gather data of the same object under backgrounds that are not expected to be used during runtime. While this project may be retrained and refocused to be utilised for many different fields - it is trained for electrical component identification, which is focused towards engineers. Architectures This project focuses on both existing engineers, and ones that are interested in becoming engineers. Having access to the provided by the project quick identification of components, count of each, and any potential additional information saves time spent manually analysing this information. SIP Resistor Singular Acronyms Introduction PCB terminal The most prominent color may be identified by sorting all the colors from the image into their hue values, and checking which hue is most active. The color codes can be identified by processing the image using filters and otherwise until only the prominent colors remain. These can be processed into the actual ohm value. Then, the positions of the color codes relative to the body of the resistor can be used to identify the specific positions and order of the color codes. The pin count can be identified by processing the image using filters until there is a clear contrast between the body of the chip, and the pins. One approach that could help identify the number of pins would be drawing a line between two of the pins and seeing how many of the pins touch this line. Taking the line that touches the most pins would provide the pin count of this IC. OCR may be used Input image Inference method Algorithm method Different color LEDs may be trained as individual classes. Has the disadvantage of requiring training for each individual LED separately, as opposed to one generic LED. Has the advantage of working on any LED. Raw input High contrast filter Colors histogram Approaches after filtering Clearly prominent yellow Has the disadvantage of potentially giving false information if the background is too vibrant. Contrast approach HSV Taking the average of all the pixels hue values that have a value above a certain threshold. Around 0.7 on a range from 0 to 1 should be appropriate. Hue is in the range of 0 to 360 degrees. The pink dots represent the pixel values obtained from the previous step. Taking the average of this data, the result would land in the degree value that can be easily determined as yellow, by separating the hue circle into sections of colors by degrees ranges. HSV, or Hue Saturation Value, is an alternative way to represent colors. It can be advantageous over RGB in situations such as this. Yellow is between 72° and 108° degrees on the hue circle. Note: This example would ignore colors that are darker than 0.7, on a range of 0 to 1. The ability to take a snapshot of the current frame, defining appropriate labels, and saving this labeled snapshot for future training. All from inside the GUI. Alternatively, taking snapshots of the GUI and saving them for later labeling. Sorted from highest priority, to lowest. Setting up the camera on a rig Base GUI GUI with essentials to interface the camera through USB, with A live display from the camera on the rig. Ability to take images by pressing a button. Support for running Inference. ~100 images of a single class, taken from the rig for initial training and testing of the model. Initial dataset gathering For the purpose of testing inference on the rig. Proof of concept. The results will not be perfect as the dataset is minimal, and only contains 1 class. Further dataset gathering At least 250 pictures of each class of every component that the project is designed to detect. Furter model training This training will take considerably longer than the initial training. Around 2 minutes per epoch, and should be ran for at least 300 epochs. The initial training should not take long at all, and does not require to be polished. Training for ~100 epochs should be sufficient, with each epoch taking ~20 seconds on the machine available. Rig Model Training via Deep Learning Machine used Personal Computer CPU GPU AMD RyzenTM 7 5800X3D Core count 8 Base clock frequency 3.4GHz L3 Cache 96MB Maximum operating temperature 90°C Thread count 16 GeForce RTX 3060 Ti Memory 8192MB CUDA core count 4864 Capacity Type GDDR6X Base clock frequency 1.41GHz The goal is to reach 0.8 from range of 0 to 1 confidence values. Ability to gather further information from the detection bounding boxes provided by the inference. After the previous steps are in good shape, investigation of moving the inference to a mobile device will begin. If the confidence values are not up to standard, more data will be gathered from this and potentially other mobile devices, and further training will follow, until the results are adequate. If adequate results are achieved before the deadline of this project, deployment to a mobile device will be started. If the frame rates are not sufficient enough, the inference may be ran on still images to improve user experience. Optional: Ability to label the images from the device, without requiring external software. It may be advantageous given the timeframe of the project to instead gather data during a session and labelling it afterwards. Memory Capacity Type 2x16GB DDR4 Frequency 3.6GHz Brand Corsair Name Vengeance RGB PRO SL Link https://www.corsair.com/eu/en/Categories/Products/Memory/Vengeance-RGB-PRO- SL-Black/p/CMH32GX4M2E3200C16 Brand AMD Name Ryzen 7 58700X3D Link https://www.amd.com/en/products/cpu/amd-ryzen-7-5800x3d Brand NVIDIA Series 30 Name RTX 3060Ti Link https://www.nvidia.com/en-gb/geforce/graphics-cards/30-series/rtx-3060-3060ti/ CUDA Special cores that are designed for compute-intensive tasks. These run parallel with the CPU, and may also run parallel with multiple GPUs. They are perfect for deep learning, as deep learning is incredibly compute intensive. Deep learning training times are predictable, and stay mostly constant between epochs. This means that there are no race conditions, and the more processing power available, the quicker the epoch will finish. Each of these steps should be polished before continuing to the next one, to provide a solid foundation for the next step to be based on. Analysis of the models Brief History YOLO, which stands for You Only Look Once is a popular image segmentation and object detection model that was originally developed by Joseph Redmon and Ali Farhadi. The first version was released in 2015, and it very quickly became popular due to the significantly superior speed and accuracy when compared to other architectures. YOLOv1 YOLOv4 Released in 2018, Introducting of Mosaic data augmentation, and a new and improved loss function - decreasing time taken to achieve better results for the trained model. YOLOv5 Released in 2020, Introducing support for Object Tracking - which allows following a moving object, and Panoptic Segmentation, which allows identification of overlapping objects, with accurate bounding boxes. Ultralytics YOLOv8 The latest version of YOLO as of today. YOLOv8 is a state-of-the-art model that builds upon the already very successful previous YOLO versions, introducing new performance and flexibility features. Full support for previous YOLO versions, making it incredibly convenient for existing users of previous YOLO versions to take advantage of the new features. Versions Comparison In general, YOLOv8 is superior to all of its predecessors. While YOLOv5 is mostly underperforming when compared to the next versions, it is important to note how incredibly minimal the delays are even on a version so outdated now. YOLO offers pretrained models that are used to start train custom models. Each model has its advantages and disadvantages, and should be picked depending on the project. Size mAP single-model single-scale values while detecting on the COCO val2017 dataset. Speed Averaged time taken using the Amazon EC2 P4d instance on the COCO dataset. The pixel height and width the model operates up to. Params (In Millions) The number of parameters that are tweaked per epoch while training, and processed during inference. FLOPS Floating Point Operations Per Second A measure based on Floating Point Operations that is relevant in the field of Deep Learning. Diminishing results can be observed on the mAP values when compared to the time taken (Speed). Model properties In some circumstances, max precision is essential, and is prioritised over the hardware requirements. This is when a higher model should be chosen. In the scope of this project - the YOLOv8m model has been chosen. The morale behind this choice is to take the advantage of the high mAP value, while not exceeding the time taken too much, in preparation for a future mobile deployment of the model. In comparison of YOLOv5 and YOLOv8 versions - a clear advantage can be seen when taking into account the size of the model (param count), and the resulting mAP output, as well as the time taken. Architecture choice YOLO has been chosen as the architecture that this project utilises for the AI detection. At the start of the project, there was already a high bias towards YOLO due to the highly positive past experience with YOLOv5 and all the incredible features that it offers. Upon release of YOLOv8 and all the superior features and specifications that it provides on top of the previous versions - YOLOv8 was an obvious choice in the architecture that will be used for the project. As the name suggests, YOLO focuses on detection of multiple classes in a single "look", which is a single analysis of the entire input image. When compared to many other architectures before YOLO, realistically, no matter how quick the other architectures may be - this is an incredibly superior approach, as other architectures would approach detection by reanalysing the entire image for every single class that the model was trained for - increasing the time taken per detection additively per class. An approach like this may seem too good to be true, and that it should come with signficant cost to the speed and confidence of the model. But when the results are analysed - that could barely be further from the truth. YOLO is an incredibly efficient and accurate architecture. These days most sophisticated architectures approach object detection similarly to YOLO, but YOLO is still a state-of-the-art architecture that continues to improve and grow to this day. Internal AI Object Detection steps Classification Object Detection Segmentation The process of identifying the exact bounding box of the item detected. The Bounding by a box of the classified segments of the image. The identification of a part of an image believe to contain an item of a class the model was trained to detect. Visual examples Resizing Joining up of multiple images to create new ones The reduction of data required to train makes it feasible to train relatively high quality models from data gathered and trained from home. Marking Codes Hardware Raspberry Pi 4 Beaglebone Nvidia Jetson Nano Intel Neural Compute Stick 2 Specifications Processor Base Frequency 700MHz Memory 2GB Specifications Core Count (SHAVE) 16 Advantage Offers computational power through a USB connection - can be used to run Inference on existing devices, such as a laptop. Specifications Specifications Core Count 4 Maximum Frequency 700MHz Resistors and Inductors Capacitors ICs Color coded Number coded Android Phone Specifications depend on the specific device. Widely and easily accessible. The vast majority of mobile phones on the market today have a built-in camera. CUDA cores provided by the GPU CPU Inference Training Personal Computer Rented Dedicated Server Advantages Disadvantages Advantages Disadvantages Local - Provided a local machine is already owned, it is immediately available. Utilises multiple GPUs - Quicker epoch computations, resulting in quicker training. Cloud based Allows for parallel computing, as opposed to using your personal computer at home. Cloud based - upload and download times Datasets tend to be considerably big in size. A smaller dataset of ~2000 images takes up ~3gb of space. This is not a significant amount of data for a local machine to transfer, but it is a considerable amount for uploading. Cost The bigger the server - the higher the rates become. Cost As opposed to a rented server - acquiring your own machine has the benefit of owning the machine, and being able to use it indefinitely (Or until it eventually breaks.) While the initial cost of acquiring an adequate machine for deep learning is higher than renting a server for a few months, it is a worthwhile long-term investment into a machine that can be used for a variety of casual or intensive tasks. Setup time Setup time Speed Speed When compared to a sophisticated server that runs many GPUs - a local machine will most likely process the training at a slower rate than a dedicated server would. A local machine will likely contain one, maybe two GPUs. Pictures are taken from the machine itself. No upload/download times. Devices Discussion After the training is done, which is usually over the span of 10's, and sometimes 100's of hours, depending on the size of the dataset and the epoch count - Running the trained model for inference only takes time in the range of milliseconds to process a single frame. R-CNN Description Disadvantages Not real-time. On average, takes 47 seconds to process a single frame. Discussion It should be noted that R-CNN has a successor called Fast R-CNN and Faster R- CNN. However, even the fastest of the choices still barely manages 5 frames a second at best. R-CNN, which stands for Region Based Convolutional Neural Networks was released in 2013. As other object detection architectures, R-CNN takes an input image, and outlines bounding boxes where it believes an item of a certain class is present. While 5 frames a second is an impressive and definitely useable result, there are alternative architectures that offer a significant improvement in inference time. Developed by Ross Girshick SSD Description SSD, which stands for Single Shot Detector. SSD was released in 2017 Developed mostly by Max deGroot and Ellis Brown Discussion Offers great framerates of an average of 45 frames per second when tested on a relatively old now graphics card NVIDIA GTX 1060. Disadvantages According to the Git repository, the project was seemingly abandoned about 4 years ago. According to the Git repository, the project was seemingly abandoned about 5 years ago. Discussion Existing labelling related software offers quality of life features, such as rough auto labelling of the images, which only requires the user to adjust the bounding boxes and confirm their validity, rather than having to define the boxes from start to finish. Technology utilised Deep learning computation with CPU Cores and GPU CUDA Cores running in parallel. 220Ohm resistor example Color codes Red = 2 Brown = 1 Gold = 5% tolerance 100nF capacitor example Unfortunately for the purposes of automatic identification of Integrated Circuit markings, most IC manufacturers do not follow any global standard for marking their ICs. Most manufacturers tend to have their own internal IC marking standards. Due to this fact - only known markings can be used to identify components. Mixed manufacturer ICs example This example illustrates the vast variation and lack of identifiable without access to datasheets markings. Architecture YOLOv5 Architecture Progress Issues encountered A glitch in augmentation provided by YOLOv5, where rotation during augmentation has shifted the bounding boxes of the components, causing inaccurate feedback to the model, preventing it from training appropriately. Actual bounding boxes after rotation augmentation Note the unnecessarily expanded bounding boxes. Description Submitted GitHub issue Link https://github.com/ultralytics/yolov5/issues/10639 Information gathered from replies as of todays date This issue has been reported to be part of YOLOv7 augmentation also. Example Expected bounding boxes after rotation augmentation Note the snug fit of the bounding box around the edges of the component. That is desirable, as it provides accurate information on what the model should be looking for. This will train the model in undesirable ways, detecting parts it should not. Augmentation rotation issue Software Description Epochs Augmentation Training Description Pretrained models Loss function train vs val labels Windows/Linux/Mac Desktop/Laptop Machine Discussion Discussion Specifications depend on the specific device. The specs of a desktop/laptop machine will most likely beat the specs of both a phone, and a microprocessor. Desktops are widely accessible in environments where it would be relevant to use this project, such as the home of the user, or the campus a student is in. Ease of access Ease of access Most people own a mobile device, and have it on them in most cases. Ease of access Due to the device being specialised for neural computations, it is not a common device by any means. Combined with the price tag of ~100 eur, this device will likely only be owned by developers, as opposed to users. As this device is unlikely to be owned by a user of the project, it would not be wise to require owning one to run our inference. Discussion The project will be able to support a compute stick as an alternative to a GPU. The machine must have permissions for USB connections and running the application. The app may be obtained from an App Store, that mobile devices have easy access to, as long as they have access to the internet. Specifications Specifications There are countless types of android devices on the market, all with varying specifications. Camera Platforms The application is designed through Qt Creator. Qt Creator is cross-platform. Cross-Platform Microprocessors USB computation extensions How long specifically is directly tied to the speed of hardware that the model is being ran on, and the size of the model. Even with all the speed optimisations offered by the YOLO family, a lower end device such as a Raspberry Pi 4 may take 1-2 seconds to process a single 360p image. It is important to pick appropriate hardware for your particular use cases. CPU Core Count 4 GPU Maximum Frequency 1.5GHz CPU Core Count 1 Maximum Frequency 1GHz Core Count Maximum Frequency GPU 2 532MHz CPU GPU Core Count 4 Maximum Frequency 1.479GHz Core Count Maximum Frequency 128 921MHz Discussion This microprocessor is targeted towards quick graphical computations, which can instead be used for deep learning. Discussion Conclusion Despite the additional strains, the proejct is currently on-track and is following the planned milestones in sequence. The original project concept at the time of project proposition had a slight shift in focus. Existing Solutions Problem Background Algorithmic object detection Detection of objects from an image Difficult and unreliable by algorithmic approaches A computer does not know the difference between pixels representing a cat or a dog. There are various cases in which automation of object detection as opposed to having a human constantly observing footage is beneficial. AI based Object Detection Algorithm based detection can be used to effectively identify very specific criteria, which can be expressed as an analytical value, or a trend. Examples Algorithm based detection can be implemented to reliably detect Specific color based properties. Specific shapes by following line trends. Must be coherent shapes, does not perform well with partial shapes. Where algorithm based detection will not be reliable Identification of generic objects, such as: trees, cats, dogs, people, cars, etc. Specific patterns. Deep Learning Epoch Inference Deep learning is achieved through what is known as neural networks, which is a complex combination of usually many millions of digital neurons, with analog based values for each neuron. The process of tweaking the entire model based on the existing parameters and the current output that it produces - by use of a loss function, randomness, and clever technique, in hopes of improving the detection ability of the data the model was trained upon. These digital neurons work together to identify the incoming information and produce an output that resembles what the network has learned during the training of the model. Training is usually based off of a pre-trained model, that is trained on a big dataset. Most pre-trained models are trained on the COCO dataset, which is publicly available, and holds a vast amount of data. Training Description Initially, the model parameters are set to a random state. The best model is kept between the newly trained model, and the previous best. Many epochs are ran in order to polish the model as much as possible. The output that it produces when fed input data, is of course, also random. Epochs are ran on the model to train the model. In object detection, inference is the utilisation of the model to process classifications of objects on the image. Classification The process of using the steps provided in the model to identify objects from an input image, through steps that depend on the architecture used. The identified objects are marked with a bounding box, class they belong to, and confidence in the prediction value. Because of how far ahead the YOLO architecture is when compared to most other architectures, is utilised very commonly throughout any object detection projects. Internal steps of the You Only Look Once Inference The project was intended to focus on a field that is more familiar, which was more on the theme of a high precision tool, which requires a set rig to push beyond the limitations to produce high quality results. This has eversince shifted more towards a more flexible, but less precise concept with the consideration of more generic use through mobile deployment. As mobile deployment of object detection was not explored, this has put additional strain on the timeline. Great results in object detection have now been achieved through the set rig, however, that is not quite the case outside of a set rig. The model has been train on on over 3000 images now that were taken and labeled over the course of several months. At this rate, a significant increase in the dataset would be required for adequate results outside of a set rig, which would require a significant amount of additional time investment. This is of course is not a very viable option without neglecting other parts of the project. Transition from YOLOv5 to YOLOv8 is currently being considered. YOLOv8 as of right now only has experimental versions of some features that are essential to the successs of this project, which are currently being tested to see if they will be adequate enough to upgrade the current architecture. Overall, YOLOv8 has shown a considerable increase in confidence values. After the achieval of great confidence value results, the next milestone is the image post-processing.